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Considered in the paper is the problem of selecting a diagnostic biomarker that has the 
highest classification rate among several candidate markers with dichotomous outcomes. 
The probability of correct selection depends on a number of nuisance parameters from 
the joint distribution of the biomarkers and thus can be substantially affected if these 
nuisance parameters are misspecified. A two-stage procedure is proposed to compute the 
needed sample size that achieves the desired level of correct selection, as so confirmed by 
simulation results. 
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1. Introduction 

In the area of diagnostic medicine, a biomarker is used to identify subjects that 
have a certain condition, e.g. a disease or risk status for the disease, so that 
proper treatments can be followed (Zhou et al. 2002). Often, multiple candidate 
biomarkers are developed simultaneously and medical investigators are 
interested in selecting the one with the highest diagnostic accuracy among 
these newly identified biomarkers so that future studies can be focused on the 
selected biomarker. 

One approach to achieving this goal is as follows. A number of subjects are 
randomly selected from a population with the condition and one without the 
condition, hereafter referred to as 'diseased' and 'non-diseased' populations, 
respectively. The candidate biomarkers are then tested on each subject and their 
diagnostic accuracy is evaluated from the test results. The biomarker that yields 
the highest (estimated) accuracy will then be identified as the best biomarker. 
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A key feature associated with such a selection process is the probabihty of 
correct selection (PCS), that is, the probability that the selected biomarker 
indeed has the highest diagnostic accuracy. The number of diseased and non- 
diseased subjects needs to be sufficiently large so that the PCS can be maintained 
at a certain desired level. Since the biomarkers are tested on the same subject, 
their test outcomes from the same subject arc correlated, following a multivariate 
distribution. Thus, the PCS depends on a number of parameters from the 
joint distribution which have to be specified in order to have an estimate of the 
sample size. Unfortunately, however, the PCS can be substantially adversely 
affected if these parameters are not correctly quantified. 

To relax the dependence of the PCS on these parameters, we propose a two- 
stage procedure. First, we randomly select a number of diseased and non- 
diseased subjects and use the test results from these subjects to estimate the 
corresponding parameters in the PCS function. Then, the total number of 
diseased and non-discascd subjects is computed using these estimates that 
achieve the desired PCS level when the diagnostic accuracy of the best biomarker 
differs from the other biomarkers by a specified amount. Simulation results are 
presented to demonstrate the effectiveness of the proposed procedure in 
achieving the PCS. This paper ends with some discussions in §6. 



2. Formulation of the problem 

Suppose there are K candidate biomarkers under consideration, and each yields a 
binary test result on a subject, 1 or 0, representing the subject being classified as 
diseased or non-diseased, respectively, by the biomarker. To assess their diagnostic 
accuracy, these biomarkers are tested on a random sample of m diseased and n non- 
diseased subjects. For the A;th biomarker, its test result is denoted by Xg, from the 
zth diseased subject and Yjj. from the jth non-diseased subject. With these 
notations, the sensitivity (the probability of correctly classifying a diseased subject) 
of the kth biomarker is ' P{Xi^. = 1} and can be estimated by p^. = Xj^/ m. 
Its specificity (the probability of correctly classifying a non-diseased subject) is 
P{ Yjf. = 0} = 1 — with q^. — P{ =1} and can be estimated by \ — qj., with 
qj^ = "^^^=1 ^jkl'^- The classification rate 6}. of the fcth biomarker is the sum of the 
two quantities, i.e. 0^ — p^. — qk + 1 — P{Xii; — 1} — P{ Yjf. = 1} -|- 1, and can be 
estimated subsequently by dk— {Yl%i ^ik)/''^ ~ {Ylj'=i ^jk)/''^'^ ^- The best 
biomarker is the one with the largest 6, and the one that yields the largest 
^-estimate will be chosen as the 'best' biomarker. Selection error occurs when the 
largest ^-estimate is not from the biomarker with the largest d. 

Without loss of generality, we assume that the first biomarker is the best 
among the candidate markers, i.e. — maX(t>i^fc. The lib. biomarker is selected 
as the best if di = maxj.>i^;t. Therefore, the PCS is 

= max 01. 

k>i 



P<^ di = ra&xdf. 



The key issue is to determine m and n so that the PCS will be at least y, a 
specified level of probability, under certain parametric configuration of the ^s. 
Usually we require that p>y when — maxj5.>2^i: > ^, where A>0 is also a 
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prespecified constant. For sufficiently large m and n, di^{k— 1, ...,K) follows 
asymptotically a multivariate normal distribution. Thus if p — y for 
^1 — maXfc>2^A: = ^, then p'>j for — maxi.>2i9j > Zl. Therefore, the sample 
sizes m and n are computed by solving the equation 



P<^ di — max 6}. 



In order to compute the sample sizes, parameters other than A which appear in 
the equation are given values based on 'educational guess' or other data sources 
if possible. 



3. Computation of PCS 

The test outcomes of the K biomarkers from a diseased or non-diseased subject 
follow a multinomial distribution. Exact calculation of the PCS involves the joint 
probabilities of {X,i = 0 or 1, X^k—^ or 1} and { ^,1 = 0 or 1, Yj^—O or 1}, 
which becomes extremely complicated when K is relatively large. Instead, 
we may use normal approximation to relax the computational complexity, 
assuming that the sample sizes are relatively large. 

Write 6— {di, ...,6x) and (9= (di, ...,dj^). Then by classical asymptotic theory 
(Bickel & Doksum 1977), 6 follows asymptotically a multivariate normal 
distribution with mean vector 6 and a variance-covariance matrix given by 

var(^fc) = — - Pi-) + - - Qk) 
m n 

and 

cov(4, di) = — {pu - PkPi) + -{qk.i - quQi), 
m n 

where Pk,i = P{Xik = = l},^^^ = P{ Y^^ = Yii = !}• 

Define ~ ^A+i ~ ^''^^ ^k — ^k+i~^ii foi" k—\,...^K — \^ and write 6= 
{6i^ and 6 — {5i^ ...,6x-i)- Then 5 is also normally distributed with 

mean vector 8 and variance-covariance matrix 2= {(Ju)i where 

(^kk = var(4) = — {Pfc+i +Pi -2pi,fc+i -(pi - Pk+if] 
n 

and 

^ki = — {PA+i,m - - V\,i+\ +Pi -bi - Vk+\){V\ - Pm)} 

TO ^ ' 

+ - - + q\-{i\ - qk+i){qi - qi+i)}- 

Let F{xi, Xx-i] (5, ^) be the multivariate normal distribution function with 
mean 5 and variance-covariance matrix 2, then the PCS is approximately 

p(<5,i:) =P{4<0,A; = l,...,if-l} =F(0,...,0;6,i:). (3.1) 
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The sample sizes m and n are computed from the equation F(0, ...,0;6,S) — y 
under the constraints that d^—A for aU k— 1, ...,K — 1. 



4. A two-stage procedure 

The asymptotic formula (3.1) of the PCS involves K{K+1) parameters 
{Pki Qk^ Pk,b Qk,i '■ ^ f^T^'^ ^ f^} from the multinomial distributions. With 
the K—1 constraints d^—A, there are distributional parameters (referred 

to as nuisance parameters) left to be specified in order to solve for sample sizes, 
assuming that m/n — X is a known constant. If these nuisance parameters are 
misspecified, then the PCS can be substantially adversely affected, resulting in 
an unacceptably high level of selection error; see §5 for numerical demonstration. 
In order to maintain the desired level of the PCS, we propose, following Stein's 
(1945) idea, a two-stage selection procedure to compute the required sample size 
N— m + n—{\ + X)n with ratio X—m/n fixed. 

Step 1. Randomly select rui diseased and Ui non-diseased subjects from the 
target populations and test the biomarkers on these subjects. Use the test 
outcomes X,;fe, i=l, rui and Y^x-, J=l; ■■■) to obtain empirical estimates of 
the nuisance parameters, given as follows: 



Pk 



1 '"i 1 "1 

— Vx,,, (4.1 



^1=1 ^ 7=1 



1 "'1 1 "1 

= — X,,X,u Qkj = — Y^k Yfl. (4.2) 



Pk.i 

nil 



1=1 ■"^ j=i 



Step 2. Plug these estimates into the PCS formula and then solve for sample 
size N' from the equation p — y. The required sample size N is then set to 
max(iVi, iV'), with Ni — rrii + ni. 

The sample size A'^ computed in this way is no longer a fixed quantity. Rather, 
it is a random variable, a statistic of the data from the first stage. Note that the 
estimates in (4.1) and (4.2) are consistent estimators of the corresponding 
parameters. For sufficient sample size A^i, the PCS is expected to be 
approximately y regardless of the true values of the nuisance parameters; see 
simulation results below. 



5. Numerical evaluation 

Numerical studies are conducted to investigate the effects of the nuisance 
parameters on PCS and the performance of the proposed two-stage procedure in 
maintaining the level of PCS. To reduce the computational burden, we use three 
biomarkers {K—2>) for illustration. 

Table 1 presents the values of PCS under various configurations of the 
parameter space which determine the joint distribution of the three 
biomarkers, that is, the joint probability of p^^], — P{Xii — i,Xi2— j,Xi^ — k} 
and q^i^ — P{ Yn — i, Y12 — j, Yi^ — k}, for i, j, k—0, 1. These joint probabilities 
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thus give the marginal probabihties, for example, 

Pi,2 = P{Xn = 1,^12 = 1} = Pni +Pno 
Pi = P{Xn = 1} = Pin + Pno + Pwi + Pwo, 

and so on. 

All configurations in table 1 correspond to a fixed value of 0.1 for 
A {—01 — 62 — di — 6^). The first row in the table specifies the parameters 
used to compute the sample size, for a single-stage procedure, assuming that 
X—m/n—l. It turns out that, with this configuration, 102 diseased and 102 non- 
diseased subjects are needed to achieve a PCS of at least 0.95. 

The p- values, computed using formula (3.1), in the remaining rows of table 1 
are the PCS with 102 diseased and 102 non-diseased subjects under other 
specifications of the parameter values. We see that the PCS can be substantially 
lower than the desired level of 0.95 if the parameters differ from the specified 
values in the first row of the table. For example, if the corresponding parameters 
from the joint distribution of the three biomarkers take values from the second 
row rather than as being specified in the first row of the table, then the PCS can 
be decreased by more than 20%! 

To investigate whether the proposed two-stage procedure can relax the 
dependence on nuisance parameters and maintain the PCS at the desired level, 
we conducted 10 000 simulations for each configuration of the parameter space in 
table 1. The simulated values of PCS are presented as p* in the last column of the 
table. For the first-stage sample size, we used mi — 75 diseased and ni — 75 non- 
diseased subjects to estimate the corresponding nuisance parameters. 

The results in table 1 clearly show that the PCS is maintained at the desired 
level of 0.95, regardless of the parametric configuration under which the data are 
generated. Thus the two-stage procedure is proved to be efficient in controlling 
selection error rates. Numerical results (not shown here) for other configurations 
and ratios of A reveal similar findings. With such a multidimensional parameter 
space, however, it is difficult to characterize any configurations for which the 
two-stage procedure would perform relatively better. 



6. Discussion 

In this paper, we proposed a two-stage procedure to select the best diagnostic 
biomarker among a number of candidate markers, and numerically demonstrated 
its effectiveness in controlling the selection error. While there is no universal 
standard on how the first-stage sample size is determined, it should be relatively 
large so that accurate estimates of the nuisance parameters can be obtained. On 
the other hand, it should not be too large when taking the cost effectiveness of a 
study into consideration. 

We confine our attention to biomarkers with dichotomous outcomes. However, 
the two-stage approach proposed in this paper can be easily extended to 
biomarkers with ordinal or continuous test results. In these two cases, an 
appealing selection criterion is to select the marker that has the largest area 
under its receiver operating characteristic curve (Hanley 1989). 
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Various selection problems and related procedures have been discussed 
extensively in the statistical literature (see Bechhofer et al. 1968; Gibbons 1988), 
but mostly focusing on multivariate normal distributions. Applications of the 
selection concepts and theory to the area of diagnostic medicine are relatively 
new, and many other issues remain to be solved such as selecting a fixed number 
of best diagnostic biomarkers and selecting the best biomarker (or all 
biomarkers) that are better than a standard biomarker. Further research efforts 
are much needed along these lines. 

Throughout this paper we assume that the data will be available for all 
biomarkers from each subject. Very often values on some biomarkers may be 
missing. One possible approach to dealing with this situation is to use multiple 
imputation based on the multinomial distributions and then apply the proposed 
method to the imputed data. The effectiveness of this approach in controlling the 
PCS needs further investigation. Moreover, selection bias may occur when the 
subjects are not randomly selected, and correction for such bias in estimating the 
PCS is desirable. 
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